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THE PATENTS ACT. 1970 

(39 of 1.970) 
APPLICATION FOR GRANT OF A PATENT 
(Sec Sections 5ul. ". 54 and 135) 
-i 



0 3 DEC 2C02 



STMicroe/ectronics Pvt. Ltd., an Indian company, of Pfor No. 2 
Sector 16A. Institutional Area, Noida - 201 300 i ', Uttar Pradesh, India. 

herebv declare - 



(a) that 1 am/we are in possession of an invention titled "Linear Scalable 
F FT/ IF FT Computation In A Multi-Processor System. *' 

(b) that the proviGional / complete specification relating to this invention is 
filed with this application 

(c) that there is no lawful ground of objection to the grant of a patent to 
me/us. 

further declare that the inventor(s) for the said inventions is/are 

(i) SANA Kaushik, an Indian citizen, of A2, Staff Flats, LP. College, 
Shamnath Mar™, Delhi - 1 10 054, India. 



(ii) MA IT I Srijih Narayan, an Indian citizen, of R-5/1, Duk Bungalow 
_ Road, Saratpalli, Midnapore - 721101, W.B. India. 

I/we claim the priority from the application(s) filed in convection countries, 
particulars of which are as follows: NA 

I/we state that the said invention is an improvement in or modification of the 
invention the particulars of which are as follows and of which I/we are the 
applicant/patentee: NIL 

I/we state that the application is divided out of my/our application,, the particulars 
of which are given below and pray that this application be-deemed to have been, 
filed on under section 16 of the Act. NIL 

That I am/we are the assignee or legal representative of. the true and first 
inventors. 

That my/our address for service in India is as follows: 

AN AND & AN AND, Advocates 
li-4 1 , Nirjnmiddin East 
New Delhi — 1 10 Of J 



Tel N os.: (II) 4355078* 4355076, 4350360 
FaxNos.: (Ii) 4354243, 4352060 
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a) kauShik saha an Indian National -f p Sniff Flu, v IP en 

Marg.Dc/hi-/ 10054. India. . U V U. UMege. Shamnaih 



Signature 

Uated this Z~>n<A day of 2002 



b) Srijib Narayan Maiti an Indian National of R </; n / t> , 
Saratpalli,Midnapore-72U01, WB ' Zta * 



Signature 



Dated this Z-tOoI day of 2002 
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■ m^s-o^t^ «> grant of patent to 

Following are the attachment with the application 

(a) Complete specification (3 copies) 

(b) Abstract 

(c) Formal drawings 

(d) Power of Attorney 

(e) Form 1 (in triplicate) 

(f) Form 3 ( in duplicate) 

(g) Fee Rs. 5000/- In cash/cheque/bank draft bearing no. , date ' 

Bank. 

• UWc request that a patent may be granted to me/us for the said invention. 
Dated this V*J day d^2002 ** 
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To ^J3™si&zte€tvon\cs Pvt. Limited 



The Controller of Patents 
The Patent Office, Delhi 
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COMPLETE SPECIFICATION 



[See Section 10] 



'LINEAR SCALABLE FFT/EFFT COMPUTATION IN A MULTI-PROCESSOR 

SYSTEM' 



/A, 



STMicroelectronics PvL Ltd., Plot No. 2 & 3, Sector 16A, Institutional Area, N'oida - 201 301, 
Uttar Pradesh, India, an Indian Company 



The following specification particularly describes and ascertains the nature of this invention and 
the manner in which it is to be performed. 
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LINEAR SCALABLE FFT/DFFT COMPUTATION IN A MULTI-PROCESSOR SYSTEM 
Field of the invention: 

The present invention relates to the field of digital signal processing. More particularly the 
invention relates to linearly scalable FFT/IFFT computation in a multiprocessor system. 

Background of the invention: 

The class of fourier transforms thatfrefer to signals that are discrete and periodic in nature are 
known as Discrete Fourier Transforms (DFT). The discrete Fourier transform (DFT) plays a key 
role in digital signal processing in areas such as spectral analysis, frequency domain filtering and 
polyphase transformations. 

The DFT of a sequence of length N can be decomposed into successively smaller DFTs. The 
manner in which this principle is implemented falls into two classes. The first class called 
^'decMationjn^e^and the second called "decimation in frequency". The first derives its name 
from the fact that in the process of arranging the computation into smaller transformations the 
sequence x(n) (the index 'n' is often associated with time) is decomposed into successively 
smaller subsequences. In the second general class the sequence of DFT coefficients x(k) is 
decomposed into smaller subsequences (k denoting frequency). The present invention employs 
"decimation in time". 

Since the amount of storing and processing of data in numerical computation algorithms is 
proportional to the number of arithmetic operations, it is generally accepted that a meaningful 
measure of complexity, or of the time required to implement a computational algorithm, is the 
number of multiplications and additions required. The direct computation of the DFT requires 
4N 2 real multiplications and N(4N-2) real additions. Since the amount of computation and thus 
the computation time is approximately proportional to N 2 it is evident that the number of 
arithmetic operations required to compute the DFT by the direct method becomes very large for 
large values of N. For this reason, computational procedures that reduce the number of 
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multiplications and additions are of considerable interest. The Fast Fourier Transform (FFT) is 
an efficient algorithm for computing the DFT. 

The conventional method of implementing an FFT or Inverse Fourier Transform (IFFT) uses a 
radix-2 / radix-4 / mixed-radix approach =with either "decimation in time (DIT)" or a 
"decimation in frequency (DEF)" approach.. 

The basic computational block is called a "butterfly" — a name derived from the appearance of 
flow of the computations involved in it Fig.-l shows a typical radix-2 butterfly computation. 1.1 
represents the 2 inputs (referred to as the "odd" and 'even" inputs) of the butterfly and 1.2 refers 
to the 2 outputs. One of the inputs (in this case the odd input) is multiplied by a complex quantity 
called the twiddle factor (Wn*). The general equations describing the relationship between inputs 
and outputs is as follows: 

X[k] = x[n] + x[n+N/2]W N k 

.X0C+N/2] = x[n] - x[n+N/2] W N k 

An FFT butterfly calculation is implemented by a z-point data operation wherein fc z v is referred 
to as the "radix". An *N* point FFT employs N/z butterfly units per stage (block) for logz N 
stages. The result of one butterfly stage is applied as an input to one or more subsequent butterfly 

stages. 

Computational complexity for an N-point FFT calculation using the radix-2 approach = 
0(N/2 * log 2 N) where N is the length of the transform. There are exactly N/2 * log 2 N 
butterfly computations, each comprising 3 complex loads, 1 complex multiply, 2 
complex adds and 2 complex stores. A full radix-4 implementation on the other hand 
requires several complex load/store operations. Since only 1 store operation and 1 load 
operation are allowed per bundle of a typical VLIW processor, cycles are wasted in 
doing only load/store operations, thus reducing ILP (Instruction Level parallelism). The 
conventional nested loop approach requires a high looping overhead on the processor. 
It also makes application of standard optimization methods difficult. Due to the nature 
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of the data dependencies of the conventional fft/tfet , 

_ q large. It is therefore necessary ■„ many applications requiring high-speed or real 

US paten. m6 ^ a muhjprocraSor approach ^ 

«h- . a ptpehned ^ wherein ^ _ .. ^ £~ 

preceding processor fa order to perform its share of wnrt n. ■ 

US patem 5,293,330 describes a pipelined processor for mixed si*: FIT Here too ,h '~ „ 
does no, provide Unear sdahi^ fa .faoughp^ as I, Is pipefated. ^ 

required bee,. k rntplementauon, utter processor communication is 

« because subsequent computus on one processor depend on uttemrediate resutts fton, 

— — — — -r— ■ — .< — JJ^2 
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Summary of the invention: 

The object of the present invention is to overcome the above drawbacks and provide, linear 
scalability of throughput in a multiprocessor system. 

To achieve the aforementioned objective, the present invention provides a modified arrangement 
for enabling the parallel computation of different butterflies in different processors. 

The invention provides a linear scalable method for computing a Fast Fourier Transform (FFT) 
or Inverse Fast Fourier transform (EFFT) in a multiprocessing system using a Decimation in 
Time approach, comprising the steps of: 

computing the first two stages of an N-point FFT/IFFT as a single radix-4 
butterfly computation operation while implementing the remaining (log2N-2) 
stages as radix-2 operations,- fusing the 3 main nested loops of each radix-2 
butterfly stage into a single radix-2 butterfly computation loop, and 
distributing the computation of the butterflies in each sage such that each 
— ^processor computes an equal-number-atcomplete-butterfly calculations thereby 
eliminating data interdependency in the stage. 

The said distribution of butterfly compulation is implemented by assigning the memory locations 
addresses corresponding to the inputs and outputs required for each specific butterfly 
calculations to a selected processor. 

The instant invention also provides a linear scalable system for computing a Fast Fourier 
Transform (FFT) or Inverse Fast Fourier transform (IFFT) in a multiprocessing system using a 
Decimation in Time approach, comprising: 

means for computing the first two stages of an N-point FFT/IFFT as a single 
radix-4 butterfly computation operation while implementing the remaining 
(log2N-2) stages as radix-2 operations, 

means for fusing the 3 main nested loops of each radix-2 butterfly stage into a 
single radix^2 butterfly computation loop, and 
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means for distributing the computation of the butterflies in each stage such that 
each processor computes an equal number of complete butterfly calculations 
thereby eliminating data interdependency in the stage. 

The said means for distributing the computation of the butterflies is implemented by means for 
assigning the memory locations addresses corresponding to the inputs and outputs required for 
specific butterfly calculations to the selected processor. 

Further, the invention provides a Computer program product comprising computer readable 
program code stored on a computer readable storage medium embodied therein for computing a 
Fast Fourier Transform (FFT) or Inverse Fast Fourier transform (IFFT) in a multiprocessing 
system using a Decimation in Time approach, comprising: 

computer readable program code means configured for computing the first two 
stages of an N-point FFT/TFFT as a single radix-4 butterfly computation operation 
while implementing the remaining (lo & N-2) stages as radix-2 operations, 
" -^V"* ^able-program code means configured for fusing the 3 main nested - 
loops of each radix-2 butterfly stage into a single radix-2 butterfly computation 
loop, and 

computer readable program code means configured for distributing the 
computation of the butterflies in each stage such that each processor computes an 
equal number of complete butterfly calculations thereby ehminating data 
interdependency in the stage. ' * 

The said computer readable program code means configured for distributing the computation of 
the butterflies is implemented by computer readable program code means configured for 
ass,gnmg the memory locations addresses corresponding to the inputs and outputs required for 
specific butterfly calculations to a selected processor. 

Brief Description of the Drawings: 

The present invention will now be explained with reference to the accompanying drawings 
which are given only by way of illustration and are not limiting for the present invention. 
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Fig'l shows the basic structure of the signal flow in a radix-2 butterfly computation for a 
discrete Fourier transform. 

Fig 2 shows a 2-processor implementation of butterflies for an 8-point FFT, in accordance with 
the present invention. 

Fig 3 shows a 4-processor implementation of butterflies for an 8-point FFT, in accordance with 
the present invention. 

Detailed Description of the Drawings: 

Fig. 1 has already been described in the background to the invention. 

Fig 2 shows the implementation for an 8 point FFT in a 2-processor architecture using the 
"p7e7e^^tio^D^ are computed in one processor, and dashed- Hnes-in the other. The - 

computational blocks are represented by '0'. The left side of each computational block is its 
input (the time domain samples) while the right side is its output (transformed samples). The 
present invention uses a mixed radix approach with decimation in time. The first two stages of 
the radix-2 FFT/IFFT are computed as a single radix-4 stage. As these stages contain only 
load/stores and add/subtract operations there is no need for multiplication. This leads to reduced 
time for FFT/IFFT computation as compared to that with full radix-2 implementation. The next 
stage has been implemented as a radix-2. The three main nested loops of conventional 
implementations have been fused into a single loop which iterates (N/2*(log 2 N-2))/(number of 
processor) times. Each processor is used to compute one butterfly in one loop iteration. Since 
there is no data dependency between different butterflies in this algorithm, the computational 
load can be linearly divided among the different processors, leading to the linear scalability. 

The mechanism for assigning the butterflies in this manner consists of assigning the memory 
location to a processor such that each processor computes a complete butterfly. To achieve this a 
binary digit is inserted at the appropriate bit location in the address of the memory location for 
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input/output data for the computation of the butterfly, depending on the stage of the FFT 
transformation. 

Fig 3 shows a 4-processor implementation for the 8-point FFT using this invention. Different 
line styles represent computation in each of the 4 processors. 

In the present implementation of the invention each processor comprises of one or more ALUs 
(Anthmetic Logic unit), multiplier units, data cache, and load/store units. All processors share a 
common instruction cache, multi-port memory. Very Large Instruction Word (VLIW) processors 
are representative of such architectures and can be used for meeting these requirements. 

It will be apparent to those with ordinary skill in the art that the foregoing is merely illustrative 
mtended to be exhaustive or limiting, having been presented by way of example only and that 
vanous modifications can be made within the scope of the above invention. 

Accordingly, this invention is not to be considered limited to me specific_examples_chosen for_ 
purposes of disclosure, but rather to cover all changes and modifications, which do not constitute 
departures from the permissible scope of the present invention. The invention is therefore not 
limited by the description contained herein or by the drawings, but only by the claims. 



8 



V 

02-IND-120 
We claim: 

1. A linear scalable method for computing a Fast Fourier Transform (FFT) or Inverse Fast 
Fourier transform (IFFT) in a multiprocessing system using a decimation in time 
approach, comprising the steps of: 

computing the first two stages of an N-point FFT/IFFT as a single radix-4 
butterfly computation operation while implementing the remaining (log 2 N-2) 
stages as radix-2 operations,- fusing the 3 main nested loops of each radix-2 
butterfly stage into a single radix-2 butterfly computation loop, and 
distributing the computation of the butterflies in each sage such that each 
processor computes an equal number of complete butterfly calculations thereby 
eliminating data interdependency in the stage. 

2. A linear scalable method as claimed in claim 1 wherein said distribution of butterfly 
computation is implemented by assigning the memory locations addresses corresponding 
to the inputs and outputs required- for each specific-butterfly ealeulatiens-to-a selected 
processor. 

3. A linear scalable system for computing a Fast Fourier Transform (FFT) or Inverse Fast 
Fourier transform (IFFT) in a multiprocessing system using a decimation in time 
approach, comprising: 

means for computing the first two stages of an N-point FFT/IFFT as a single 
radix-4 butterfly computation operation while implementing the remaining 
(log2N-2) stages as radix-2 operations, 

means for fusing the 3 main nested loops of each radix-2 butterfly stage into a 
single radix-2 butterfly computation loop, and 

means for distributing the computation of the butterflies in each stage such that 
each processor computes an equal number of complete butterfly calculations 
thereby eliminating data interdependency in the stage. 
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4. ,A linear scalable system as claimed in claim 3 wherein said means for distributing the 

computation of the butterflies is implemented by means for assigning the memory 
locations addresses corresponding to the inputs and outputs required for specific butterfly 
calculations to the selected processor. 

5. A computer program product comprising computer readable program code stored on a 
computer readable storage medium embodied therein for computing a Fast Fourier 
Transform (FFT) or Inverse Fast Fourier transform (IFFT) in a multiprocessing system 
using a decimation in time approach, comprising: 

computer readable program code means configured for computing the first two 
stages of an N-point FFT/IFFT as a single radix-4 butterfly computation operation 
while implementing the remaining (logjN-2) stages as radix-2 operations, 
computer readable program code means configured for fusing the 3 main nested 
loops of each radix-2 butterfly stage into a single radix-2 butterfly computation 
loop, and 

—computer readable -program— code— means wnfigured for distributing the 
computation of the butterflies in each stage such that each processor computes an 
equal number of complete butterfly calculations thereby eliminating data 
interdependency in the stage. 

6. The computer program product as claimed in claim 5 wherein said computer readable 
program code means configured for distributing the computation of the" butterflies is 
implemented by computer readable program code means configured for assigning the 
memory locations addresses corresponding to the inputs and outputs required for specific 
butterfly calculations to a selected processor. 

7. A linear scalable method for computing a Fast Fourier Transform (FFT) or Inverse Fast 
Fourier transform (IFFT) in a multiprocessing system using a decimation in time 
approach substantially as herein described with reference to and as illustrated in figures 2 
and 3 of the accompanying drawings. 
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8. A linear scalable system for computing a Fast Fourier Transform (FFT) or Inverse Fast 
Fourier transform (IFFT) in a multiprocessing system using a decimation in time 
approach substantially as herein described with reference to and as illustrated in figures 2 
and 3 of the accompanying drawings. 

9. A computer program product comprising computer readable program code stored on a 
computer readable storage medium embodied therein for computing a Fast Fourier 
Transform (FFT) or Inverse Fast Fourier transform (IFFT) in a multiprocessing system 
using a decimation in time approach substantially as herein described with reference to 
and as illustrated in figures 2 and 3 of the accompanying drawings. 

of AN AND & AN AND, Advocates 

Agents for the Applicants 



Dated this2V day of pe ^ b , 2002 
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ABSTRACT 

This invention relates to a linear scalable method for computing a Fast Fourier Transform (FFT) 
or Inverse Fast Fourier transform (IFFT) in a multiprocessing system using a decimation in time 
approach. Linear scalability means, as the number of processor increases by a factor P (for 
example), the computational cycle reduces by exactly the same factor P. The invention 
comprises, computing the first two stages of an N-point FFT/TFFT as a single radix-4 butterfly 
computation operation while implementing the remaining (log 2 N-2) stages as radix-2 operations, 
fusing the 3 main nested loops of each radix-2 butterfly stage into a single radix-2 butterfly 
computation loop, and distributing the computation of the butterflies in each sage such that each 
processor computes an equal number of complete butterfly calculations thereby eliminating data 
interdependency in the stage. u 

The invention also provides a linear scalable system and computer program product for 
computing FFT/IFFT in a multi-processor system. 
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Figure 1. Butterfly, the basic computational block of FFT/EFFT 
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Figure 2. Butterfly distribution for 2-processor configuration 
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Figure 3. Butterfly distribution for 4-processor configuration 
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